K-Nearest Neighbors

Starting off with the usual imports, these should be second nature by now!

We'll be starting with a basic example of how KNN works with an automatically generated dataset, so we'll also need the following imports.

Sklearn's make_blobs() function is used to generate clustered data based on a variety of parameters; this is perfect for looking at how certain algorithms work.

We will also need the KNeighborsClassifier model for our classification task; note that we will need to import from sklearn.neighbors.

Create and Visualize The Data

To start things off, we'll be creating a dataset with 400 total points clustered into two groups.

Using make_blobs(), we can also set the number of features each data point will have. We'll set the number of features to 2 so we can plot the features easily.

We now already have the data split into X and y sets, which is perfect for the training step.

However, we may want to create a dataframe from these numpy arrays to be able to visualize what the data actually looks like.

There are two species of fish: red fish and blue fish. A target of 0 represents a red fish and a target of 1 represents a blue fish. For each fish, we have its length and weight.

Now, lets create a scatter plot with each feature on a seperate axis.

Each point will have a color corresponding to its target value, the value that we want to predict.

As can be seen in the plot, our data is nicely split into two distinct clusters. Because of this, we can make a pretty reasonable guess about the target value of a new point. Red fish tend to be a shorter and heavier, while blue fish are longer and lighter.

Let's train the model and see if it matches our intuition. Since we're only predicting a single point at a time, we'll train the model on the entire dataset.

Fit and Predict

Instead of doing train_test_split, we're going to select our own test points to get a better sense of the model.

Let's predict three distinct points and see what the model decides to output.

  1. (6, 6.5)
  2. (2, 6)
  3. (4, 6)

Success! Each of the predictions matched our intuition!

KNN on Real World Data

We tested the algorithm on a small, artificial dataset with only 2 features, but how well does it perform on real world data with potentially many features?

We will be using a breast cancer dataset to hopefully find an answer to this question.

Loading and Preprocessing the Data

Fitting the Model

By default, the KNeighborsClassifier uses a value of K=5.

Looks like the default classifier gave a pretty decent score. However, we need to note that accuracy is not a great metric for determining model performance and the score itself is relative.

What value of K?

While the default classifier was decent, we can also adjust the number of neighbors that the model looks at to get potentially better results.

As we can see, adding more neighbors does not equate to higher model accuracy. Each dataset is unique, and we need to select a value for K based on the nuances in the data to get the best performance from our KNN model.

For this particular dataset, a K value of 10 (look for the 10 nearest datapoints) gave us the best results on the unknown data.